06. Policy Gradient Quiz
Policy Gradient Quiz
Suppose we are training an agent to play a computer game. There are only two possible action:
0 = Do nothing,
1 = Move
There are three time-steps in each game, and our policy is completely determined by one parameter \theta, such that the probability of "moving" is \theta, and the probability of doing nothing is 1-\theta
Initially \theta=0.5. Three games are played, the results are:
Game 1:
actions: (1,0,1)
rewards: (1,0,1)
Game 2:
actions: (1,0,0)
rewards: (0,0,1)
Game 3:
actions: (0,1,0)
rewards: (1,0,1)
Computing policy gradient
SOLUTION:
(2,1,1)SOLUTION:
-2SOLUTION:
- The contribution to the gradient from the second and third steps cancel each other
- The computed policy gradient from this game is negative
- Using the total reward vs future reward give the same policy gradient in this game